Supplementary material for The Human Genome Contracts Again
نویسندگان
چکیده
Genome compression has been the subject of multiple studies in the past several years. In general, studies in this field can be grouped into two main categories: reference-based compression and non-reference based compression. Methods which do not use reference genomes exploit the repetitive nature of genomic sequences, and compress the input data by using variations over general-purpose compression schemes, such as LZ77 [Lempel & Ziv 1977], which are modified to account for features specific to genomic data, such as inversions, palindromes, and approximate repeats. There are many studies in this category, including [Grumbach & Tahi 1994, Powell et al 1998, Gusev et al. 1999, Apostolico & Lonardi 2000, Matsumoto et al. 2000, Adjeroh et al. 2002, Chen et al. 2002, Tabus et al. 2003, Manzini & Rastero 2004, Behzadi & Le Fessant 2005, Korodi & Tabus 2005, Srinivasa et al. 2006, Cao et al. 2007, Korodi & Tabus 2007, Pinho et al. 2011, Kreft & Navarro 2011]. For a review, see [Giancarlo et al. 2009]. The best compression obtained with such approaches results in about 600MB per genome (see [Pinho et al. 2011] for comparison of several methods). Methods in the second category (reference-based) obtain much better compression by utilizing the fact that the genomes of different individuals are more than 99.8% identical and coding just the differences between the input genome and a reference genome. Studies in this category include [Christley et al. 2009, Brandon et al. 2009, Deorowics & Grabowski 2011b, Kuruppu et al. 2011, Wang & Zhang 2011, Hsi-Yang Fritz et al. 2011, Pinho et al. 2012, Chern et al. 2012]. The different methods differ in the way the reference sequence is selected or computed from a set of genomes, and the way the difference map is computed and compressed (usually using LZ-inspired schemes). Reference-based methods can reportedly reduce the size of the compressed genome to 3.1MB-19MB. In general, better results may be obtained when multiple sequences are compressed simultaneously, or when multiple reference genomes are used (note that two tasks are closely related), as discussed in [Mantaci et al. 2005, Mäkinen et al. 2010, Kuruppu et al. 2011, Deorowics & Grabowski 2011b, Kuruppu et al. 2012]. The best results we are aware of among schemes that simultaneously compress multiple genomes are reported in [Deorowics & Grabowski 2011b], who compress 70 complete human genomes to an average size of 3.1 MB per genome. The best compression using a single reference genome thus far was reported in [Christley et al. 2009], who compressed James Watson’s genome to 4MB, by using dbSNP to represent more efficiently known SNPs in the difference map. Another class of relevant studies focus on compression of short reads [Tembe at al. 2010, Deorowics & Grabowski 2011a, Hsi-Yang Fritz et al. 2011]. One of these schemes [Hsi-Yang Fritz et al. 2011] uses a reference genome against
منابع مشابه
Cytomegalovirus Active Infection in Persons Infected with Human Immunodeficiency Virus
Background and Objective: Cytomegalovirus (CMV), one of the most common opportunistic pathogens in patients infected with human immunodeficiency virus (HIV), can cause the diseases such as encephalitis, pneumonia, and chorioretinitis. This study aimed at molecular studying of CMV infection in individuals infected with the human immunodeficiency virus. Material and Methods: In this study, 50 ven...
متن کاملAnalysis of Supplementary Banking Contracts in Imamieh Jurisprudence and Law
Whe the Bank is obliged to apply certain contracts as specified in the Law on Banking Operations to finance its economic activities, first to attract monetary resources as investment funds from the community and then as Advocate applicants from these resources. In note 23 of the Law on Continuous Improvement of the Business Environment Act, 2011, the Central Bank is required to regulate uniform...
متن کاملGenome compression: a novel approach for large collections
MOTIVATION Genomic repositories are rapidly growing, as witnessed by the 1000 Genomes or the UK10K projects. Hence, compression of multiple genomes of the same species has become an active research area in the past years. The well-known large redundancy in human sequences is not easy to exploit because of huge memory requirements from traditional compression algorithms. RESULTS We show how to...
متن کاملAll Human-Specific Gene Losses Are Present in the Genome as Pseudogenes
The loss of previously established genes has been proposed as a major force in evolutionary change. While genome sequencing of many new species offers the opportunity to identify cases of gene loss, it is unclear which algorithms offer the greatest accuracy or sensitivity. A number of methods to identify gene losses rely on the presence of a pseudogene for each loss. If genes are deleted when l...
متن کاملPredicting functional regulatory polymorphisms
MOTIVATION Limited availability of data has hindered the development of algorithms that can identify functionally meaningful regulatory single nucleotide polymorphisms (rSNPs). Given the large number of common polymorphisms known to reside in the human genome, the identification of functional rSNPs via laboratory assays will be costly and time-consuming. Therefore appropriate bioinformatics str...
متن کاملThe Prevalence of Human Papilloma Virus in Esopha-geal Squamous Cell Carcinoma
Background: Carcinomas of esophagus, mostly squamous cell carcinomas, occur throughout the world. There are a number of suspected genetic or environmental etiologies. Human papilloma virus (HPV) is said to be a major etiology in areas with high incidence of esophageal carcinoma, while it is hardly detectable in low incidence regions. This study was designed to evaluate the prevalence of HPV in ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2013